I was talking with Dr. Kate Stange at the University of Colorado Boulder, who runs the very cool Experimental Mathematics Lab, about the project in the spring for undergraduates to visualize the online encyclopedia of integer sequences, or OEIS.
I was curious what data the students would even have to work with. On their website, the OEIS lists compressed versions available for download: one containing just the sequences and their A-numbers, and the other containing just the names of the sequences and their A-numbers. (The A-numbers are a canonical ID used in the OEIS).
Yet like the name of the database suggests, each integer sequence really has an entire encyclopedic entry associated with it, including often detailed comments, references to literature, links, and most interestingly, “crossrefs” to other sequences. For example, see https://oeis.org/search?q=id:A000032 (as well as the corresponding json api and oeis text api). This would all make a wonderful structured dataset, but I couldn’t find the data online, so I just downloaded them one at a time from the API. It took a few hours since I didn’t want to overload their servers.
Parsing out the cross-refs, I made this force-directed graph. Out of 317,727 sequences, only 233,727 had any in/out references at all. The other ones don’t appear on the graph. And in the graph you can see a giant central component: that one has 198,794 nodes.
I haven’t looked deeper into the structure yet, aside from identifying, well, that there is a giant blob in the middle, and that it all looks kind of cool. Someone asked me today what the protruding strand was from the blob and I don’t know yet either. Maybe some undergrads who do the experimental math program can tell me!
Some of the ideas I had for extensions were:
-
Generate number theoretic features from each sequence, then use them to do link prediction on the human-generated crossrefs. How well does it do under cross-validation? How does it compare to the OEIS Superseeker?
-
Embed the graph into hyperbolic space, which is an active area of research. How naturally does the OEIS crossrefs seem to resemble an emergently hyperbolic network?
To make the graph, I used graph-tool’s implementation of a spring force-directed graph, which scales at $O(V\log(V))$ with the number of vertices $V$. With 233,727 vertices, it took my ThinkPad somewhere between 3 and 8 hours – I let it run all night, with my laptop set to suspend after 5 hours.